This report explores a dataset containing about 4,900 white wines with 11 variables and 1 output variable on quantifying the chemical properties of each wine.
## [1] 4898 13
This column gives an overall rating of the wines, according to the following scale:
poor: quality <= 4
average: 5 >= quality >= 6
good: quality >= 7
## 'data.frame': 4898 obs. of 14 variables:
## $ Count : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ rating : Ord.factor w/ 3 levels "poor"<"average"<..: 2 2 2 2 2 2 2 2 2 2 ...
## Count fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality rating
## Min. : 8.00 3: 20 poor : 183
## 1st Qu.: 9.50 4: 163 average:3655
## Median :10.40 5:1457 good :1060
## Mean :10.51 6:2198
## 3rd Qu.:11.40 7: 880
## Max. :14.20 8: 175
## 9: 5
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## poor average good
## 183 3655 1060
The “quality” here is the output variable and is rated between 0 (very bad) to 10 (very excellent). The histogram here shows us that no white wine in the data has been rated below 3 or 10.
Most of the wines have an “average” rating, leaving the “poor” and “good” wines looking like outliers. This raises few questions, like:
How accurate is the data?
Were the wines all selected randomly for testing purposes and how many brands were involved or what’s the variety in general?
What’s the age of the wine or how old were they when tested?
According to the document, the quality is decided by the wine experts based on the median of at least 3 evaluations, so it’s not quite clear what factors were taken into consideration for deciding the quality. We’ll now look at all the variables individually to get a better understanding of it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The plot for fixed.acidity can be either seen as slightly positively skewed or having a normal distribution, with mean = 6.855 and the median = 6.800. Outliers removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
volatile.acidity is positively skewed. Wines with lower values of volatile.acidity is quite common, as higher levels of it can lead to unpleasant taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
As per the documentation, citric acid is found in small quantities, that explains the lower values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
residual.sugar graph is positively skewed, with a significant spike at 1.5 g/dm^3, and the scales have been adjusted to remove the outliers present at higher values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most wines have low amount of sodium chloride in them. Outliers removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Most wines contain about 34 - 35 mg/dm^3 of free sulfur dioxide. The plot has been re-scaled to remove the outliers of higher values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
total.sufur.dioxide is the combination of free and bound forms of SO2, and follows nearly the same pattern as free.sulfur.dioxide. Outliers removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The plot for the density is a normal distribution, with median = 0.9937 and mean = 0.9940. Outliers removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
From the plot above we can see, most of the wines are on the pH scale between 3 to 3.5. The distribution is almost normal with median = 3.180 and mean = 3.188.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The plot looks like bimodal with peaks at 0.39 and 0.48.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Positively skewed graph. Plot shows most white wines contain about 9 - 12% of alcohol in them.
White wine data contains 4898 observations under 14 variables. The 14th variable, “rating”, was added to generalize the “quality” scores into 3 categories: “poor”, “average”, “good”. As it turns out, most of the wines have an average rating, with a whooping 3655 wines under this category and it could be either we are working on a sample of the actual data set or the data is incomplete or the testings were done under certain constraints like brands, location etc.
One of my main interests is to find which variable or combination of variables strongly affect the quality or rating of a wine. Also, how the variables affect each other, for instance how the density of a wine is affected by alcohol and sugar content.
At this point, I believe features like acidity (fixed and volatile), residual sugar, alcohol content may have crucial effect on the pH and quality of a wine.
Yes. The “rating” variable was added to the dataframe based on the “quality” scores.
Most of the plots either have a normal distribution or are positively skewed. In that respect, I found the alcohol and residual sugar variables’ distribution to be different with it’s several spikes along the scale, which makes it a bit difficult to relate it’s effect on the wine quality. Perhaps more analysis alongside of other variables can help bring out more reasoning to the pattern.
Except for adding of the new ordered variable “rating” to the dataframe, there has been no adjustment made to the original data. While building the histograms, outliers were removed for some of the variables to get a better readability on the plots:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.28918070
## volatile.acidity -0.02269729 1.00000000 -0.14947181
## citric.acid 0.28918070 -0.14947181 1.00000000
## residual.sugar 0.08902070 0.06428606 0.09421162
## chlorides 0.02308564 0.07051157 0.11436445
## free.sulfur.dioxide -0.04939586 -0.09701194 0.09407722
## total.sulfur.dioxide 0.09106976 0.08926050 0.12113080
## density 0.26533101 0.02711385 0.14950257
## pH -0.42585829 -0.03191537 -0.16374821
## sulphates -0.01714299 -0.03572815 0.06233094
## alcohol -0.12088112 0.06771794 -0.07572873
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## sulphates alcohol
## fixed.acidity -0.01714299 -0.12088112
## volatile.acidity -0.03572815 0.06771794
## citric.acid 0.06233094 -0.07572873
## residual.sugar -0.02666437 -0.45063122
## chlorides 0.01676288 -0.36018871
## free.sulfur.dioxide 0.05921725 -0.25010394
## total.sulfur.dioxide 0.13456237 -0.44889210
## density 0.07449315 -0.78013762
## pH 0.15595150 0.12143210
## sulphates 1.00000000 -0.01743277
## alcohol -0.01743277 1.00000000
## density chlorides volatile.acidity
## -0.307123313 -0.209934411 -0.194722969
## total.sulfur.dioxide fixed.acidity residual.sugar
## -0.174737218 -0.113662831 -0.097576829
## citric.acid free.sulfur.dioxide sulphates
## -0.009209091 0.008158067 0.053677877
## pH alcohol
## 0.099427246 0.435574715
Above vector shows the order in which all the variables affect the quality of a wine, with free.sulfur.dioxide and alcohol having least and highest effect respectively. Apart from alcohol, we can see that density, chlorides and volatile acidity too have an impact on the wine.
(Articles from http://www.brsquared.org/wine/Articles/SO2/SO2.htm and https://winemakermag.com/501-measuring-residual-sugar-techniques were referred to get an initial understanding of the various components of wine)
Following are some of the correlations that stands out with the absolute coefficient > 0.3:
fixed.acidity doesn’t affect the quality much. (cor.coeff = -0.114)
Much similar to fixed.acidity, volatile.acidity has little effect on quality. (cor.coeff = -0.195)
citric.acid has no effect on the quality of a wine. (cor.coeff = -0.009)
residual.sugar has little to no effect on the quality. (cor.coeff = -0.098)
With decreasing amount of chlorides, there is a subtle improvement in the quality. (cor.coeff = -0.210)
free.sulfur.dioxide has a very little effect on the quality. (cor.coeff = 0.008)
The total.sulfur.dioxide is inversely proportional to the wine’s quality, given that the free.sulfur.dioxide was in positive correlation with the it. (cor.coeff = -0.175)
This is because bound form of sulfur dioxide in total.sulfur.dioxide is negatively correlated to the quality. (cor.coeff = -0.218)
As density increases, the perceived quality reduces. Further analysis maybe needed to examine just exactly how much density has an effect on the quality of wine. (cor.coeff = -0.307)
Very slightly so, higher value pH is good for the quality of wine. (cor.coeff = 0.099)
Quality of wine stays quite steady with the varying amount of sulphates. (cor.coeff = 0.054)
This is where we see the strongest impact on the quality of wine. Higher content of alcohol means better tasting wine. (cor.coeff = 0.436)
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = white_wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.582009 0.098008 5.938 3.08e-09 ***
## alcohol 0.313469 0.009258 33.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Alcohol has about 18% of the effect on the wine quality. Below, is a more detailed breakdown of alcohol level depending on the quality (density plot).
We’ll now examine few plots between variables with strong correlation.
This relation makes sense as with breakdown of sugar, more alcohol is produced and thus a negative correlation.
alcohol has the most impact on the wine quality, even though it’s just 18%. It was interesting to analyze the relationship between alcohol and density, alcohol and residual.sugar, quality and pH as well, which goes to confirm that better wines have high alcohol content, low density and low residual sugar.
quality vs. pH, quality vs. fixed.acidity, quality vs. volatile.acidity, quality vs. citric.acid, positive correlation with pH and negative correlations with fixed.acidity and volatile.acidity, was an interesting thing to come across, suggesting good white wines might have low acidic value or higher pH values.
density vs. residual.sugar, with correlation coefficient over 0.8.
Since alcohol plays one of the major roles in deciding the wine’s quality, we’ll now analyze what affects the alcohol level.
A little high pH and even higher alcohol for better quality wine.
Not much correlation between alcohol and citric.acid.
Higher alcohol and less acidity for producing better wines.
Unlike, the fixed.acidity, volatile.acidity seems to have a positive effect on the alcohol, which is a strange phenomena since pH level has a positive correlation with alcohol and with increase in pH level acidity decreases. However, more data is required to confirm this conclusion.
With a negative correlation of -0.022, more data is needed to make sure if they have any significant effect on each other.
Not much correlation between alcohol and sulphates.
Lower total.sulfur.dioxide for better wine.
Low residual.sugar is equivalent to more alcohol content and thus better wine. But it appears like residual.sugar in itself might not have much effect on the wine quality.
As per the document, density is dependent on the alcohol level and as established alcohol has an high impact on the quality of wine. From the above plot, it is quite evident that higher alcohol leads to the dip in the density. It might be the case that alcohol has an impact on both the density and quality, rather than density alone affecting the quality.
##
## Calls:
## m1: lm(formula = I(as.numeric(quality)) ~ I(alcohol), data = white_wine_data)
## m2: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density, data = white_wine_data)
## m3: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density +
## chlorides, data = white_wine_data)
## m4: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density +
## chlorides + volatile.acidity, data = white_wine_data)
## m5: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density +
## chlorides + volatile.acidity + total.sulfur.dioxide, data = white_wine_data)
## m6: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density +
## chlorides + volatile.acidity + total.sulfur.dioxide + fixed.acidity,
## data = white_wine_data)
##
## ============================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------------
## (Intercept) 0.582*** -24.492*** -23.150*** -37.573*** -32.759*** -45.308***
## (0.098) (6.165) (6.162) (6.010) (6.295) (6.493)
## I(alcohol) 0.313*** 0.360*** 0.343*** 0.389*** 0.391*** 0.407***
## (0.009) (0.015) (0.015) (0.015) (0.015) (0.015)
## density 24.728*** 23.671*** 38.217*** 33.251*** 46.423***
## (6.079) (6.074) (5.926) (6.234) (6.458)
## chlorides -2.382*** -1.300* -1.370* -1.383*
## (0.558) (0.542) (0.543) (0.540)
## volatile.acidity -2.043*** -2.070*** -2.108***
## (0.111) (0.111) (0.111)
## total.sulfur.dioxide 0.001* 0.001*
## (0.000) (0.000)
## fixed.acidity -0.099***
## (0.014)
## ------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.248 0.249 0.257
## adj. R-squared 0.190 0.192 0.195 0.247 0.248 0.256
## sigma 0.797 0.796 0.795 0.768 0.768 0.764
## F 1146.395 583.290 396.315 402.956 324.034 281.812
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5657.292 -5654.027 -5627.454
## Deviance 3112.257 3101.773 3090.247 2889.234 2885.385 2854.246
## AIC 11684.782 11670.255 11654.021 11326.584 11322.054 11270.908
## BIC 11704.272 11696.241 11686.504 11365.563 11367.530 11322.880
## N 4898 4898 4898 4898 4898 4898
## ============================================================================================================
##
## Call:
## lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density +
## chlorides + volatile.acidity + total.sulfur.dioxide + fixed.acidity,
## data = white_wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4220 -0.5009 -0.0317 0.4697 3.2309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.531e+01 6.493e+00 -6.978 3.40e-12 ***
## I(alcohol) 4.067e-01 1.508e-02 26.963 < 2e-16 ***
## density 4.642e+01 6.458e+00 7.189 7.51e-13 ***
## chlorides -1.383e+00 5.399e-01 -2.562 0.0104 *
## volatile.acidity -2.108e+00 1.107e-01 -19.040 < 2e-16 ***
## total.sulfur.dioxide 6.805e-04 3.058e-04 2.225 0.0261 *
## fixed.acidity -9.927e-02 1.359e-02 -7.305 3.23e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7639 on 4891 degrees of freedom
## Multiple R-squared: 0.2569, Adjusted R-squared: 0.256
## F-statistic: 281.8 on 6 and 4891 DF, p-value: < 2.2e-16
Higher alcohol content is undoubtedly responsible for better quality of wine.
Unlike alcohol’s negative correlation with fixed.acidity and citric.acid, it has a positive relationship with the volatile.acidity.
Low total.sulfur.dioxide is evidently better for wine.
Level of alcohol affects the density and residual.sugar content. But density and residual.sugar don’t seem to have much effect on the quality by themselves.
Yes. The fact that most of the wines in the data set belong to average quality, becomes a bottleneck in determining the variation between several variables. Low R squared value can be attributed to the 18% of alcohol content’s contribution to the quality of wine. Although the R squared value improves a bit when the other variables are factored in.
Alcohol has a higher influence on the quality of wine. The mean and median for all the quality boxplots closely aligns, even for the higher quality wines where we have lesser data.
After the quality of wine, it’s crucial to highlight the relation between alcohol and residual.sugar, alcohol and density. As sugar gets fermented it results an increase in the alcohol content and thereby affecting the density. It not only shows how strongly correlated they are but also that the negative correlation between the quality and density, quality and residual.sugar is primarily due to the alcohol level.
With attributes like fixed.acidity, volatile.acidity and citric.acid in the wine, it was an interesting read on how pH varies with the quality. After an initial rise in the pH value (mostly for average and good wine), good quality wines almost have a constant pH with increasing level in alcohol.
I chose the white wine data for my project. After going through the document, it was important to look for the factors that are influential to the wine taste and following are some of the things I encountered:
Although with over 4000 data points, the dataset was particulary centered on wines of average quality.
While plotting graphs, I found that most of the variables had quite some outliers to them, which were taken care of manually by tweaking the ranges.
For Univariate analysis, I plotted histograms for all the factors to visualize their count, median and mean.
During the Bivariate analysis, all of the variables were measured against the quality. Out of which, factors like alcohol, pH, chlorides, density stood out the most. This initial analysis was quick to show that alcohol was one of the crucial reasons behind good ratings of the wine. And also, the relationship between alcohol level, residual.sugar and density.
In Multivariate analysis, I plotted some of the variables against the alcohol level by the quality of wine. In this section, I found that density and residual.sugar might not have any effect on the quality of wine.
In the beginning, I was expecting to come across attributes and the extent to which they affect the quality of the wine, but I wasn’t able to other variables that influenced the wine quality as well as the level of alcohol did.
It might be interesting to analyze the red wine data separately or alongside the white wine data. Maybe I will be able to find how some of the chemical components behave differently for both the wines and how is the quality in turn affected.